# arm

## AI/ML with Arm HPC

Centre for Development of Advanced Computing (C-DAC) / National Supercomputing Mission (NSM)

Arm in HPC Course

**Phil Ridley** 

phil.ridley@arm.com

3<sup>rd</sup> March 2021

## Agenda

- Containers
- ML and Al
  - Processor developments
  - Community support
  - Libraries and Applications
- ISA Developments

+ + + + + + + + + + + + + +





+ + + + + + + + + + + + + + +

+ + + + + + + + + + + + + + +

+ + + + + + + + + + + + + +

+ + + + + + + + + + + + + +

## Arm enables containerization through standardization

Ensuring standard interfaces work on Arm enables multiple technologies

#### Approach



arm

## Arm & Docker Partner to Deliver Frictionless Cloud-Native Software Development







Initial phase is focused on Integration of Arm capabilities into Docker Desktop Community to enable a seamless developer environment



Docker Enterprise Engine for Amazon EC2 A1 instances



Additional work will address end-to-end management of full product life cycle; unified development environments for heterogeneous compute and scaling cloudnative benefits to consolidate edge workloads



## **Docker on Arm**

Docker Desktop is the de facto standard Cloud Native development platform for containerized applications



arm



This partnership makes it easier for millions of developers already using Docker to develop containers on Arm

5,386,145 base images on Docker Hub 51,460 Arm images 46,167 Arm64 images Official 166 Docker images Arm support 118 out of 166 official images





No changes needed to Docker tooling & processes in order to start building for Arm



# Machine Learning

+ + + + + + + + + + + + + +

+ + + + + + + + + + + + + + +

+ + + + + + + + + + + + + + +

+ + + + + + + + + + + + +

## Machine Learning





## Increasing ML performance over CPU generations

#### Int8 GEMM kernel performance (normalized to A72)



#### A72

2x ML performance improvement over Cortex-A53

#### **Helios**

>3x ML performance improvement over Cortex-A53 (First Multi-threaded CPU)

#### **N1**

>5x ML performance improvement over Cortex-A72 (PPA leadership & ML enhancements)

#### Zeus

>25x ML performance improvement over Cortext-A72 (Breakthrough ML performance)

## **On-CPU ML processing**



## **On-CPU Machine Learning**

## Easy to use, high performing ML software stack on Aarch64 using ML-specific CPU features



# **Orm** Artificial Intelligence

+ + + + + + + + + + + + + + + +

+ + + + + + + + + + + + + +

+ + + + + + + + + + + + + +

+ + + + + + + + + + + + + +

+ + + + + + + + + + + + + +

## Machine Learning and Artificial Intelligence



## ML Frameworks on server-class Aarch64 platforms

- Recent effort to enable server-scale on-CPU ML workloads on AArch64
- Build guides for key frameworks available:
  - Tensorflow <u>https://gitlab.com/arm-</u> <u>hpc/packages/wikis/packages/tensorflow</u>
  - PyTorch <u>https://gitlab.com/arm-hpc/packages/wikis/packages/pytorch</u>
  - MXNET <u>https://gitlab.com/arm-hpc/packages/wikis/packages/mxnet</u>
  - And guides for key dependencies: CPython; NumPy etc.
- Currently focusing on inference problems
- ML Perf (<u>https://mlperf.org</u>) for realistic workloads.



## TensorFlow and maths libraries: on AArch64



- Arm Performance Libraries
  - Micro- architecture optimized
  - Targeting server class cores
  - High release cadence
- GEMMs at the core of matmul and convolutions
- Leveraging ArmPL has potential to deliver optimal performance in key kernels for on-CPU, server scale ML workload.

## Where to optimise: TensorFlow and its backend



### 

+ + + + + + + + + + + + +

+ + + + + + + + + + + + + +

+ + + + + + + + + + + + + + + +

+ + + + + + + + + + + + + +

+ + + + + + + + + + + + + +

## New Data Type Support: BFloat16

- New addition to Armv8.6-A
  - Adds support for BF16
- Instructions for NEON and SVE
  - Including:
    - BFDOT: Dot Product (1x2)x(2x1)
    - BFMMLA: Mat Multiply (2x4)x(4x2)
- Significant performance gains
  - ML training and inference workloads
- Supported in Arm libraries
  - Arm NN and Arm Compute Libraries



## FMMLA: High Performance Matrix Multiplication

- Added to Armv8.6
  - NEON support for INT and BF16
  - FMMLA instructions for FP (SVE)

FMMLA <Zda>.S, <Zn>.S, <Zm>.S FMMLA <Zda>.D, <Zn>.D, <Zm>.D

- 2x2 matrix multiplication
  - Works on multiple of vector granules
  - 2x2xFP32 = 128-bit granules
  - Assumes vector length is multiple
- May require layout transformations
  - Outer loop to avoid cost
- Will accelerate maths libraries



| + + + + | + + |  |  |  |
|---------|-----|--|--|--|
|         |     |  |  |  |

|   |            |   |   |   |   |   |   |   |   |   |   | Thank You                        |
|---|------------|---|---|---|---|---|---|---|---|---|---|----------------------------------|
|   |            |   |   |   |   |   |   |   |   |   |   | Danke                            |
|   |            |   |   |   | + |   |   | + |   |   |   | <sup>+</sup> Gracias             |
| + | <b>_</b>   |   | Ŧ | + |   | + | + | + | Ŧ | Ŧ |   | ,训动                              |
|   |            |   |   |   |   |   |   |   |   |   |   | ありがとう                            |
|   |            |   |   |   |   |   |   |   |   |   |   | + Asante                         |
|   |            |   |   |   |   |   |   |   |   |   |   | Merci                            |
|   |            |   |   |   |   |   |   |   |   |   |   | 감사합니다                            |
|   |            |   |   |   |   |   |   |   |   |   |   | धन्यवाद                          |
|   |            |   |   |   |   |   |   | + |   |   |   | <sup>+</sup> <sup>+</sup> Kiitos |
|   |            |   |   |   |   |   |   |   |   |   |   | شکرًا                            |
| + | +          | + | + | + | + | + | + | + | + | + | + | * ধন্যবাদ                        |
|   | © 2021 Arm |   |   |   |   |   |   |   |   |   |   | תודה ₊ ₊                         |

| - | <br>+ | + | + · | ┝ - | <br> | <br> | + - | ⊢ - | ⊢ + | <br>+ + |
|---|-------|---|-----|-----|------|------|-----|-----|-----|---------|
|   |       |   |     |     |      |      |     |     |     |         |

|   | a | r'n |   |   |   |   |   |   | trac | Arm trådema<br>demarks or tra<br>ne US and/or o<br>featured | ademarks of | Arm Limited | (or its subsi-<br>erved. All otl | diaries) in<br>her marks |
|---|---|-----|---|---|---|---|---|---|------|-------------------------------------------------------------|-------------|-------------|----------------------------------|--------------------------|
| + |   | +   | + | + | + | + | + | + | +    | +                                                           | +           | +           | +                                | +                        |

www.arm.com/company/policies/trademarks

|            |  |  |  | + |   |   |  |  |  |
|------------|--|--|--|---|---|---|--|--|--|
|            |  |  |  | + | + | + |  |  |  |
| © 2021 Arm |  |  |  |   |   |   |  |  |  |